In this project we are going to explore the data systematically, so there is no specific question we want to answer.
- Name: result.csv
- Definition: Ford GoBike System - Data
- Source: https://www.fordgobike.com/system-data
- Version:Files from 01.2018 - 02.2019
#Import important libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms
import zipfile
import requests
import geopy.distance
from sklearn.cluster import KMeans
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
At first we need to gather the data. The code below will download, unzip and merge the data together to a final *.csv file.
#define filenames to download
year_data = [x for x in range(201801, 201813)] + [x for x in range(201901, 201903)]
#loop over years
for year in year_data[:3]:
#url structure
url = f"https://s3.amazonaws.com/fordgobike-data/{year}-fordgobike-tripdata.csv.zip"
#get response
response = requests.get(url)
#save file
with open(f"{year}-fordgobike-tripdata.csv.zip", mode = "wb") as file:
file.write(response.content)
#https://stackoverflow.com/questions/3451111/unzipping-files-in-python/3451150
#define file names
files = [x for x in os.walk("./Data")][0][2]
#loop over file names
for x in files:
if ".zip" in x:
with zipfile.ZipFile(f"./Data/{x}",'r') as zip_ref:
zip_ref.extractall("./Data")
month_range = [int(x[:6]) for x in files if ".zip" not in x] #can be exchanged with year_data ..
#read in the first file
result = pd.read_csv(f"./Data/{month_range[0]}-fordgobike-tripdata.csv")
len_df = len(result)
#loop over all files and append data one by one
for i, month in enumerate(month_range):
if i+2 > len(month_range):
break
else:
df_append = pd.read_csv(f"./Data/{month_range[i+1]}-fordgobike-tripdata.csv")
len_df += len(df_append)
assert all(result.columns == df_append.columns)
result = result.append(df_append)
print(f"{month_range[i+1]}")
result.to_csv("result.csv", index = False)
print("Done")
print(len_df == len(result))
del len_df, df_append, result, files, month_range
Now that we have our final "result.csv" file, we have to make sure that the data is ready for an analysis.
df = pd.read_csv("result.csv")
df.info()
Here we can already see, that the start_time and end_time column is not a datetime - object. Also the ID's and the member_birth_year should be object type.
df.head()
ms.matrix(df);
It seems that we also have a problem with missing data.
df[df.start_station_id.isnull()].head(10)
df[df.start_station_id.isnull()].tail(10)
All the rows with missing data have similar coordinates. Also these coordinates are less precise than the other entries with no missing data.
df[df.start_station_id.isnull()].start_station_latitude.value_counts()
df[df.start_station_id.isnull()].start_station_longitude.value_counts()
The most frequent coordinates are at 37.41 latitude and -121.94 longitude.
axes, figure = plt.subplots(figsize = (10,5))
sns.scatterplot(data = df[df.start_station_id.isnull()], x = "start_station_longitude", y = "start_station_latitude", alpha = 0.15, s = 200)
sns.scatterplot(data = df.dropna(subset=["start_station_id"]).sample(50000), x = "start_station_longitude", y = "start_station_latitude", alpha = 0.15, s = 200)
plt.xlim(-121.8,-122)
plt.ylim(37.3,37.45)
plt.tight_layout()
We can see, that these 'nan' stations are mostly out of the range of the other stations.
#all ids are a start- and ending point
start_station_id_list = list(df.start_station_id.drop_duplicates().dropna().astype("int"))
start_station_id_list.sort()
end_station_id_list = list(df.end_station_id.drop_duplicates().dropna().astype("int"))
end_station_id_list.sort()
start_station_id_list == end_station_id_list
df_station_names = df[["end_station_id", "end_station_name", "end_station_latitude", "end_station_longitude"]].copy()
df_station_names.rename(columns={"end_station_id": "id",
"end_station_name": "station_name",
"end_station_latitude": "station_latitude",
"end_station_longitude": "station_longitude"}, inplace = True)
df_station_names.drop_duplicates(inplace = True)
df_station_names.dropna(inplace = True)
df_station_names.sort_values("id", inplace = True)
df_station_names.reset_index(inplace = True, drop = True)
axes, figure = plt.subplots(figsize = (10,5))
#sns.scatterplot(data = df[df.start_station_id.isnull()], x = "start_station_longitude", y = "start_station_latitude", alpha = 0.15, s = 200)
sns.scatterplot(data = df_station_names, x = "station_longitude", y = "station_latitude", s = 200)
plt.tight_layout()
At first we can see, that there are three clear clusters in this dataset. The website of FordGoBike differs following zones: "San Francisco, East Bay, San José"
df_station_names.id.value_counts()[df_station_names.id.value_counts()>1]
Here we can see an interesting thing: One ID can refer to multiple different stations.
df_station_names[df_station_names.duplicated("id", keep = False)]
It is possible, that some these stations changed position over time and/or got a new name. This is a data consistency problem, because the ID's are not unique.
df_station_names.drop_duplicates(subset = ["id", "station_latitude", "station_longitude"]).id.value_counts().head(25) #excludes 4 cases
If we filter the data by these stations, which didn't changed the position and just got a new name, we can exclude 4 cases. The rest of the cases are still relevant to investigate.
df_consistency = df[["end_station_id", "end_station_name", "end_station_latitude", "end_station_longitude"]].copy().merge(df_station_names, left_on = "end_station_id", right_on = "id", how = "outer")
df_consistency_issues = df_consistency.drop_duplicates().dropna()
df_stat_dist = df_consistency_issues[(df_consistency_issues.end_station_latitude != df_consistency_issues.station_latitude) & \
(df_consistency_issues.end_station_longitude != df_consistency_issues.station_longitude)].copy()
#https://stackoverflow.com/questions/19412462/getting-distance-between-two-points-based-on-latitude-longitude/43211266#43211266
def calculate_coord_dist(col):
coords_1 = (col[0], col[1])
coords_2 = (col[2], col[3])
return geopy.distance.distance(coords_1, coords_2).m
df_stat_dist["dist_in_m"] = df_stat_dist[["end_station_latitude", "end_station_longitude", "station_latitude", "station_longitude"]].apply(calculate_coord_dist, axis = 1)
df_stat_dist_data = df_stat_dist.drop_duplicates("dist_in_m").sort_values("dist_in_m", ascending = False)
df_stat_dist_data
We can see that the range of distance between new stations lies between 0 to 364 meters (and nearly 4 km in the worst case). This needs to be a part of data cleaning. Now we will look at the overall number structure of this dataset.
df.describe().round(2)
Nothing too obvious here. There seems to be an outlier at the maximum duration_sec.
df.query("duration_sec == 86366")
Maybe somebody forgot to unregister the bike.
df.query("duration_sec > 80000")
Or some users really rent these bikes for this duration. For the final part of this assessing we will look on the missing values in the birth_year column.
df[df.member_birth_year.isnull()].head()
It is possible that the users don't have to write down their birth day information during registration. Or they decided to not share this information at all.
#there are multiple ways how to handle this - but since the coordinates are not precise we will just drop them
df.dropna(subset = ["start_station_id"], inplace = True)
#fill up the missing values
df.member_birth_year.fillna(0, inplace = True)
df.member_gender.fillna("not defined", inplace = True)
Test #1¶
ms.matrix(df)
for col in ["start_time", "end_time"]:
df[col] = pd.to_datetime(df[col])
for col in ["start_station_id", "end_station_id"]:
df[col] = df[col].astype("int")
for col in ["start_station_id", "end_station_id", "member_birth_year", "bike_id"]:
df[col] = df[col].astype("str")
Test #2¶
df.info()
On the one hand we can ignore most of these cases, because the different stations are relatively close to each other. That raises the question: How close is close? Another approach is to give the ID's a new "subindex". So everytime the coordinates of a station ID differs from the occurrence of this station before, we will increase the subindex by 1, if the calcualted difference is > x.
df.sort_values("start_time", inplace = True)
df.reset_index(inplace = True, drop = True)
df.head()
#id_data = {}
def get_new_id(col):
#when the row is not in id_data - append it
if col[0] not in id_data:
id_data[col[0]] = [col[1], col[2], col[3], f"{col[0]}_0"]
return id_data[col[0]][3]
#if the row exists in id_data, then check if the coordinates change, if yes - calculate the distance and increase the id and replace the \
#saved coordinates in id_data with the new ones, if not, then just return the saved id
elif col[0] in id_data:
if id_data[col[0]][1] != col[2] or id_data[col[0]][2] != col[3]:
coords_1 = (id_data[col[0]][1], id_data[col[0]][2])
coords_2 = (col[2], col[3])
if geopy.distance.distance(coords_1, coords_2).m > 100:
new_ind = str(col[0]) + "_" + str(int(id_data[col[0]][3][-1])+1)
id_data[col[0]][3] = new_ind
id_data[col[0]][1] = col[2]
id_data[col[0]][2] = col[3]
return new_ind
else:
return id_data[col[0]][3]
else:
return id_data[col[0]][3]
else:
return "Error"
id_data = {}
df["start_station_id_new"] = df[["start_station_id", "start_station_name", "start_station_latitude", "start_station_longitude"]].apply(get_new_id, axis = 1)
id_data = {}
df["end_station_id_new"] = df[["end_station_id", "end_station_name", "end_station_latitude", "end_station_longitude"]].apply(get_new_id, axis = 1)
df_station_names.head()
df_station_names = df[["start_time", "start_station_id", "start_station_name", "start_station_latitude", "start_station_longitude"]].copy()
df_station_names.sort_values("start_time", inplace = True)
df_station_names.drop("start_time", axis = 1, inplace = True)
df_station_names.rename(columns={"start_station_id": "id",
"start_station_name": "station_name",
"start_station_latitude": "station_latitude",
"start_station_longitude": "station_longitude"}, inplace = True)
df_station_names.drop_duplicates(inplace = True)
df_station_names.dropna(inplace = True)
id_data = {}
df_station_names["new_id"] = df_station_names[["id", "station_name", "station_latitude", "station_longitude"]].apply(get_new_id, axis = 1)
df_station_names.sort_values("id", inplace = True)
df_station_names.reset_index(inplace = True, drop = True)
Test #3¶
df.query("start_station_id == '208'").drop_duplicates("start_station_id_new")[["start_station_id", "start_station_name", "start_station_latitude", "start_station_longitude", "start_station_id_new"]]
df.query("end_station_id == '208'").drop_duplicates("end_station_id_new")[["end_station_id", "end_station_name", "end_station_latitude", "end_station_longitude", "end_station_id_new"]]
#export the station names to csv
df_station_names.to_csv("df_station_names.csv", index = False)
#export the cleaned csv
df.to_csv("result_clean.csv", index = False)
Now that the cleaning part is done, we can start to visualize the data.
df = pd.read_csv("result_clean.csv")
df.info()
#wrong datatypes again, maybe change the datatype to HDF5
for col in ["start_time", "end_time"]:
df[col] = pd.to_datetime(df[col])
for col in ["member_birth_year"]:
df[col] = df[col].astype("int")
for col in ["start_station_id", "end_station_id", "member_birth_year", "bike_id"]:
df[col] = df[col].astype("str")
df.info()
df.head()
df_station_names = pd.read_csv("df_station_names.csv")
df_station_names.head()
Before we start visualizing, we can extract some additional information out of the data to improve the insights. We saw, that these datapoints can be separated in three clusters. Since these clusters are really obvious, we can classify them really efficient using the K-Means Clustering algorithm.
#https://datatofish.com/k-means-clustering-python/
kmeans = KMeans(n_clusters=3).fit(df_station_names[["station_longitude", "station_latitude"]])
df_station_names["label"] = kmeans.labels_
for x in set(list(df_station_names.label)):
df_plot_cluster = df_station_names.query(f"label == {x}")
plt.scatter(df_plot_cluster['station_longitude'], df_plot_cluster["station_latitude"], s=50, alpha=0.5, label = x);
plt.legend();
Now that this is done, we can visualize this data on a map. This can happen with Bokeh or Plotly, but we will use kepler.gl out of the visualization toolbox for map visualizations.
Source: kepler.gl

Now we will map the labels with the original names.
mapping = {0: "San Francisco", 1: "San José", 2: "East Bay"}
df_station_names["label_name"] = df_station_names["label"].map(mapping)
df_station_names.head()
df_station_names.label_name.value_counts()
df_station_names[df_station_names.duplicated("new_id", keep = False)] # duplicates for new coordinates which are not > 100 m away from the origin
df_station_names.drop_duplicates(subset = ["new_id"], inplace = True)
df_station_names.info()
df[df.duplicated(subset = ["start_time", "end_time"], keep = False)]
df.info()
Now we can merge the labels to the id's of the main dataframe.
df = df.merge(df_station_names[["new_id", "label"]], left_on = "start_station_id_new", right_on = "new_id", how = "outer")
df.info()
df[df.label.isnull()].start_station_id_new.value_counts()
df.query("start_station_id == '205'").drop_duplicates("start_station_id_new")[["start_station_id", "start_station_name", "start_station_latitude", "start_station_longitude", "start_station_id_new", "label"]]
df["label_name"] = df["label"].map(mapping)
df.label_name.value_counts()
for x in set(list(df.label)):
df_plot_cluster = df.query(f"label == {x}")
plt.scatter(df_plot_cluster['start_station_longitude'], df_plot_cluster["start_station_latitude"], s=50, alpha=0.5, label = x);
plt.legend();
Now we will extract multiple other informations out of the birth year and the start_time - timestamp.
df["age"] = df["member_birth_year"].apply(lambda x: 2018 - int(x))
df['month_year'] = pd.to_datetime(df["start_time"]).dt.to_period('M')
df['day_month_year'] = pd.to_datetime(df["start_time"]).dt.to_period('D')
df["dayofweek"] = df["start_time"].apply(lambda x: x.dayofweek)
df["start_hr"] = df["start_time"].apply(lambda x: x.hour)
df["end_hr"] = df["end_time"].apply(lambda x: x.hour)
df_age = df.query("age != 2018 and age < 100").copy()
df_age.age.min()
bins = [x for x in range(10,101, 10)]
df_age["age_bins"] = pd.cut(df_age.age, bins = bins, precision = 0, include_lowest=False)
df_age[["age", "age_bins"]].head()
df.head()
Now we are really ready for the exploratory visualizations.
df.info()
There are mutliple interesting variables in thsi dataset. Let's start with the stations first.
#https://codeyarns.com/2015/06/29/how-to-hide-axis-of-plot-in-matplotlib/
value_ct = df.start_station_id_new.value_counts().iloc[:25]
fig, ax = plt.subplots(figsize = (22,5), dpi = 80)
color = sns.color_palette("viridis")[1]
sns.countplot(x = "start_station_id_new", data = df, order=value_ct.index, color = color);
plt.ylim(0,70000)
cur_axes = plt.gca()
#cur_axes.axes.get_xaxis().set_visible(False)
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.08, p.get_height()-4000), color = "white")
plt.title("Top 25 Start Stations");
plt.xlabel("Start Station ID");
Overall we can see, that 67_0, 15_0 and 58_0 are the most "used" stations in this dataset. Let's take a look on each group separately.
for x in value_ct.index:
print(x + " - " + str(df.query(f"start_station_id_new == '{x}'").start_station_name.drop_duplicates().get_values()))
#https://codeyarns.com/2015/06/29/how-to-hide-axis-of-plot-in-matplotlib/
df_new = df.query("label == 0").copy()
value_ct = df_new.start_station_id_new.value_counts().iloc[:25]
fig, ax = plt.subplots(figsize = (22,5), dpi = 80)
color = sns.color_palette("viridis")[1]
sns.countplot(x = "start_station_id_new", data = df_new, order=value_ct.index, color = color);
plt.ylim(0,50000)
cur_axes = plt.gca()
#cur_axes.axes.get_xaxis().set_visible(False)
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.08, p.get_height()-2000), color = "white")
plt.title("Top 25 Start Stations for San Francisco");
plt.xlabel("Start Station ID");
For San Francisco we can see that this City is leading the trip counter overall.
#https://codeyarns.com/2015/06/29/how-to-hide-axis-of-plot-in-matplotlib/
df_new = df.query("label == 1").copy()
value_ct = df_new.start_station_id_new.value_counts().iloc[:25]
fig, ax = plt.subplots(figsize = (22,5), dpi = 80)
color = sns.color_palette("viridis")[1]
sns.countplot(x = "start_station_id_new", data = df_new, order=value_ct.index, color = color);
plt.ylim(0,50000)
cur_axes = plt.gca()
#cur_axes.axes.get_xaxis().set_visible(False)
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.08, p.get_height()-2000), color = "white")
plt.title("Top 25 Start Stations for San José");
plt.xlabel("Start Station ID");
#https://codeyarns.com/2015/06/29/how-to-hide-axis-of-plot-in-matplotlib/
df_new = df.query("label == 2").copy()
value_ct = df_new.start_station_id_new.value_counts().iloc[:25]
fig, ax = plt.subplots(figsize = (22,5), dpi = 80)
color = sns.color_palette("viridis")[1]
sns.countplot(x = "start_station_id_new", data = df_new, order=value_ct.index, color = color);
plt.ylim(0,50000)
cur_axes = plt.gca()
#cur_axes.axes.get_xaxis().set_visible(False)
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.08, p.get_height()-2000), color = "white")
plt.title("Top 25 Start Stations for East Bay");
plt.xlabel("Start Station ID");
It seems like the Ford GoBike Program is relatively new in East Bay and San José. These parts have lesser trips than San Francisco. In East Bay the stations 182_0, 243_0 and 176_0 are popular. For San José are the top three 310_0, 296_0 and 312_0. Now that we know, that San Francisco is the city in this project with the most active users, we will now take a look on the duration.
#maybe a customer forgot to log off
bin_size = 100
bins = np.arange(0,df.duration_sec.max()+bin_size,bin_size)
fig, axes = plt.subplots(figsize = (12,5), dpi = 110)
plt.hist(df.duration_sec, bins = bins, color= color);
plt.xticks(ticks = [x for x in range(0,7000,250)])
plt.xlim(-100,6000);
plt.tight_layout()
This graph is limited to 6000 seconds to exclude the 'outliers'. The main takeaway here is that the most trips have a duration between 250 and 550 seconds.
df.query("duration_sec < 6000").duration_sec.mean()
fig, axes = plt.subplots(figsize = (12,5), dpi = 110)
for x in mapping.values():
df_new = df.query(f"label_name == '{x}'")
bin_size = 100
bins = np.arange(0,df_new.duration_sec.max()+bin_size,bin_size)
plt.hist(df_new.duration_sec, bins = bins, label = x, histtype='step');
plt.xticks(ticks = [x for x in range(0,7000,250)])
plt.legend()
plt.xlim(-100,6000);
plt.tight_layout()
Looking at these, trends are looking similar to each other (right skewed), although it seems like trips in East Bay are usually a little bit shorter in duration.
for x in mapping.values():
print(x, df.query(f"label_name == '{x}' and duration_sec < 6000").duration_sec.mean())
The mean values also agree on that.
fig, axes = plt.subplots(figsize = (12,5), dpi = 110)
df_new = df.query(f"label_name == 'San Francisco'")
bin_size = 100
bins = np.arange(0,df_new.duration_sec.max()+bin_size,bin_size)
plt.hist(df_new.duration_sec, bins = bins, label = x, histtype='step', color = "g");
plt.xticks(ticks = [x for x in range(0,7000,250)])
plt.legend()
plt.xlim(-100,6000);
plt.tight_layout()
fig, axes = plt.subplots(figsize = (12,5), dpi = 110)
df_new = df.query(f"label_name == 'East Bay'")
bin_size = 100
bins = np.arange(0,df_new.duration_sec.max()+bin_size,bin_size)
plt.hist(df_new.duration_sec, bins = bins, label = x, histtype='step');
plt.xticks(ticks = [x for x in range(0,7000,250)])
plt.legend()
plt.xlim(-100,6000);
plt.tight_layout()
fig, axes = plt.subplots(figsize = (12,5), dpi = 110)
df_new = df.query(f"label_name == 'San José'")
bin_size = 100
bins = np.arange(0,df_new.duration_sec.max()+bin_size,bin_size)
plt.hist(df_new.duration_sec, bins = bins, label = x, histtype='step', color = "orange");
plt.xticks(ticks = [x for x in range(0,7000,250)])
plt.legend()
plt.xlim(-100,6000);
plt.tight_layout()
Now we will look at the user structure.
fig, ax = plt.subplots(figsize = (10,5), dpi = 80)
sns.countplot(x = "member_gender", data = df, order=df.member_gender.value_counts().index, palette = "viridis");
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.303, p.get_height()+50000))
plt.title("Users By Gender");
plt.xlabel("");
It seems like there are a lot more men using this service than woman, "not defined" or other.
fig, ax = plt.subplots(figsize = (10,5), dpi = 80)
sns.countplot(x = "label_name", data = df, order=df.label_name.value_counts().index, palette = "viridis", hue = "member_gender");
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
#for p in ax.patches:
# ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.01, p.get_height()+50000))
plt.title("Users By Gender");
plt.xlabel("");
for x in mapping.values():
print(x, df.query(f"label_name == '{x}'").member_gender.value_counts() / df.query(f"label_name == '{x}'").member_gender.count())
Looking at the plot and the relative frequencies, the male percentage is > 50% for all three areas.
value_ct = df.user_type.value_counts().iloc[:31]
fig, ax = plt.subplots(figsize = (10,5), dpi = 80)
sns.countplot(x = "user_type", data = df, order=value_ct.index, palette = "viridis");
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.31, p.get_height()+40000))
plt.title("Users By Type");
plt.xlabel("");
value_ct = df.label_name.value_counts().iloc[:31]
fig, ax = plt.subplots(figsize = (10,5), dpi = 80)
sns.countplot(x = "label_name", data = df, order=value_ct.index, palette = "viridis", hue = "user_type");
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left = True)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.08, p.get_height()+40000))
plt.title("Users By Type");
plt.xlabel("");
Also there are a lot more users using the subscription service than the customer usage.
fig, ax = plt.subplots(figsize = (20,5), dpi = 80)
color = sns.color_palette("viridis")[2]
sns.countplot(x = "age", data = df.query("age != 2018 and age < 73").sort_values("age"), color = color);
df.query("age != 2018").age.mean()
Overall, the average user is between 24 and 35 years old, with no user being younger than 18 (and giving their birth year data).
fig, ax = plt.subplots(figsize = (20,5), dpi = 80)
color = sns.color_palette("viridis")[2]
sns.countplot(x = "age", data = df.query("age != 2018 and age < 73 and label == 2").sort_values("age"), color = 'orange', label = "San Francisco");
sns.countplot(x = "age", data = df.query("age != 2018 and age < 73 and label == 0").sort_values("age"), color = 'g', label = "East Bay");
sns.countplot(x = "age", data = df.query("age != 2018 and age < 73 and label == 1").sort_values("age"), color = 'b', label = "San José");
plt.legend();
The East Bay age structure is broader than the one of San Francisco. San José has the youngest average group of users.
for x in mapping.values():
print(x, df.query(f"label_name == '{x}' and age != 2018").age.mean())
For the next plots we will focus on the time components of the data. At first we will explore on which days people like to go on trips. 0 refers to monday while 6 refers to sunday.
fig, ax = plt.subplots(figsize = (16,5))
sns.countplot(x = "dayofweek", data = df, palette = "viridis");
Based on this graph it looks like users use the bikes more during the week than during the weekend.
fig, ax = plt.subplots(figsize = (16,5))
sns.countplot(x = "label_name", data = df, palette = "viridis", hue = "dayofweek");
#https://stackoverflow.com/questions/43855474/changing-sort-in-value-counts
for x in mapping.values():
print(f"{x}\n", df.query(f"label_name == '{x}'").dayofweek.value_counts().sort_index() / df.query(f"label_name == '{x}'").dayofweek.count(), "\n")
This applies for all three areas. Tuesday, Wednesday and Thursday are the most active days, followed by Monday and Friday and then Saturday and Sunday. Now let's look on the trips per month/year.
fig, ax = plt.subplots(figsize = (16,5))
sns.countplot(x = "month_year", data = df, palette = "viridis");
fig, ax = plt.subplots(figsize = (16,5))
sns.countplot(x = "month_year", data = df, palette = "viridis", hue = "label_name");
For all three areas we can see that Bikesharing got more and more popular from 2018-01 - 2018-10, followed by a drop for two months during November 18 and Dezember 18.
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
fig, ax = plt.subplots(figsize = (20,5))
sns.countplot(x = "label_name", data = df, palette = "viridis", hue = "month_year");
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# Put a legend to the right of the current axis
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
fig, ax = plt.subplots(figsize = (20,5))
sns.countplot(x = "month_year", data = df.query("label_name == 'San José'"), palette = "viridis");
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
fig, ax = plt.subplots(figsize = (20,5))
sns.countplot(x = "month_year", data = df.query("label_name == 'East Bay'"), palette = "viridis");
The drop for the "San José" area held on for on more month till February 19.
#https://seaborn.pydata.org/generated/seaborn.color_palette.html
sns.palplot(sns.color_palette("viridis"))
#https://stackoverflow.com/questions/35143672/seaborn-conditional-colors-based-on-value
custom_palette = {}
for q in set(df.day_month_year):
if q.dayofweek == 0:
custom_palette[q] = sns.color_palette("viridis")[0]
elif q.dayofweek == 1:
custom_palette[q] = sns.color_palette("viridis")[1]
elif q.dayofweek == 2:
custom_palette[q] = sns.color_palette("viridis")[2]
elif q.dayofweek == 3:
custom_palette[q] = sns.color_palette("viridis")[3]
elif q.dayofweek == 4:
custom_palette[q] = sns.color_palette("viridis")[4]
elif q.dayofweek == 5:
custom_palette[q] = sns.color_palette("viridis")[5]
elif q.dayofweek == 6:
custom_palette[q] = (224/255, 228/255, 65/255)
else:
custom_palette[q] = 'g'
legend_obj = []
colors = [sns.color_palette("viridis")[0],
sns.color_palette("viridis")[1],
sns.color_palette("viridis")[2],
sns.color_palette("viridis")[3],
sns.color_palette("viridis")[4],
sns.color_palette("viridis")[5],
(224/255, 228/255, 65/255)]
days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
for i, s in enumerate(days):
legend_obj.append(plt.scatter([],[],color = colors[i]));
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
def plot_data_time(start_date="12.31.17", end_date="03.01.18"):
fig, ax = plt.subplots(figsize = (30,5), dpi = 100)
sns.countplot(x = "day_month_year", data = df[(df["start_time"] < pd.to_datetime(end_date)) & (df["start_time"] > pd.to_datetime(start_date))], palette = custom_palette);
plt.xticks(rotation = 90);
plt.ylim(0,10000);
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# Put a legend to the right of the current axis
ax.legend(legend_obj, days, loc='center left', bbox_to_anchor=(1, 0.5))
plot_data_time()
plot_data_time("03.01.18","05.01.18")
plot_data_time("05.01.18","07.01.18")
plot_data_time("07.01.18","09.01.18")
plot_data_time("09.01.18","11.01.18")
plot_data_time("11.01.18","01.01.19")
plot_data_time("01.01.19","03.01.19")
These plots also show the increasing count of trips per day and also the trend, that on the weekends are less trips than on weekdays. Also we can see drops in tripcount for example on the 23.11.18 or the 06.01.19.
fig, ax = plt.subplots(figsize = (16,5))
sns.countplot(x = "start_hr", data = df, palette = "viridis", ax = ax);
The most frequent starting hours are at 8 and at 17. Maybe people use it before and after work, which would make sense, because we have a lot of subscribers in working age in our dataset. You only subscribe to something, if you want to use it regulary. The integration into the working/study life would make sense here.
fig, ax = plt.subplots(figsize = (16,5))
sns.countplot(x = "start_hr", data = df.query("label == 0"), palette = "viridis", ax = ax);
fig, ax = plt.subplots(figsize = (16,5))
sns.countplot(x = "start_hr", data = df.query("label == 1"), palette = "viridis", ax = ax);
For San José the trend is going more torwards the hour 17.
fig, ax = plt.subplots(figsize = (16,5))
sns.countplot(x = "start_hr", data = df.query("label == 2"), palette = "viridis", ax = ax);
As we saw, the most frequent hours are 8 and 17. I want to see how this trend is changing over time, so for the next plot we will connect these two counts with a line to see the trend based on the slope.
fig, ax = plt.subplots(14,1,figsize = (16,50))
dates = [pd.to_datetime(f"2018-{x}") for x in range(1,13)] + [pd.to_datetime(f"2019-{x}")for x in range(1,3)]
for i, date in enumerate(dates):
try:
sns.countplot(x = "start_hr", data = df[(df['start_time'] < dates[i+1])&(df['start_time'] >= date)], palette = "viridis", ax = ax[i]);
ax[i].set_ylim(0,25000)
ax[i].set_xlim(-0.5,23.5)
ax[i].plot([8,17], [ax[i].patches[8].get_height(),ax[i].patches[17].get_height()], 'o-', color = "#3a3a3a")
ax[i].text(0, 22000, str(date.date()))
except:
sns.countplot(x = "start_hr", data = df[(df['start_time'] > dates[i])], palette = "viridis", ax = ax[i]);
ax[i].set_ylim(0,25000)
ax[i].set_xlim(-0.5,23.5)
ax[i].plot([8,17], [ax[i].patches[8].get_height(),ax[i].patches[17].get_height()], 'o-', color = "#3a3a3a")
ax[i].text(0, 22000, str(date.date()))
We can see that this is an ongoing trend. At 17 'o clock slightly more trips are starting in comparison to 8 'o clock (for most of the months).
for i, date in enumerate(dates):
try:
print(str(df[(df['start_time'] < dates[i+1])&(df['start_time'] >= date)].start_hr.value_counts().head(2)))
except:
pass
Now we are going to dig deeper into the data searching for relationships and trends between variables.
df.head()
#join the trips together
df["combi"] = df["start_station_id_new"] + " - " + df["end_station_id_new"]
df.combi.value_counts().head(10)
One interesting insight are the most frequent trips. Based on the data in the upper cell we can see, that the station 6_0 appears often. A heatmap should make this visible.
combi_list = df.combi.value_counts().head(15).keys()
df_criteria = df[df["combi"].isin(combi_list)]
df_pivot = df_criteria.pivot_table(index="start_station_id_new", columns = "end_station_id_new", values = "start_time", aggfunc = "count", fill_value = 0)
fig, axes = plt.subplots(figsize = (15,10), dpi = 70)
sns.heatmap(df_pivot, annot = True, cmap = "viridis_r", fmt='g', vmin = 1, vmax = 6000, mask= df_pivot == 0)
Here we can see the top 15 routes.
for i, x in enumerate(combi_list):
start, end = x.split(" - ")
start_name = df.query(f"start_station_id_new == '{start}'").start_station_name.drop_duplicates().get_values()
end_name = df.query(f"end_station_id_new == '{end}'").end_station_name.drop_duplicates().get_values()
print(start," ",start_name[0], " - ",end, " ", end_name[0])
combi_list = df.combi.value_counts().sample(20).keys()
df_criteria = df[df["combi"].isin(combi_list)]
df_pivot = df_criteria.pivot_table(index="start_station_id_new", columns = "end_station_id_new", values = "start_time", aggfunc = "count", fill_value = 0)
fig, axes = plt.subplots(figsize = (15,10), dpi = 80)
sns.heatmap(df_pivot, annot = True, cmap = "viridis_r", fmt='g', vmin = 1, vmax = 6000, mask= df_pivot == 0)
df.head()
#creating the legend object for the next plot
legend_obj = []
colors = [sns.color_palette("viridis")[0],
sns.color_palette("viridis")[1],
sns.color_palette("viridis")[2],
sns.color_palette("viridis")[3],
sns.color_palette("viridis")[4],
sns.color_palette("viridis")[5],
(163/255, 199/255, 70/255)]
days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
for i, s in enumerate(days):
legend_obj.append(plt.scatter([],[],color = colors[i]));
Another interesting combination would be the day of the week combined with the average duration. For this we will create a groupby - object (mean) over each month-year combination. A Box Plot should be appropriate to visualize this relationship.
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "dayofweek", y = "duration_sec", data = df.groupby(["dayofweek", "month_year"], as_index = False).mean(), palette = "viridis")
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# Put a legend to the right of the current axis
ax.legend(legend_obj, days, loc='center left', bbox_to_anchor=(1, 0.5))
We can clearly see, that under the week the users are going on shorter trips ~ 780 Seconds while the average duration on the weekend rises to ~ 1200 seconds.
#https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "dayofweek", y = "duration_sec", data = df.groupby(["dayofweek", "month_year", "label_name"], as_index = False).mean(), palette = "viridis", hue = "label_name")
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
This trend applies for all areas, while we can also see that the users of San Francisco have, on average, the longest duration of trips, followed by East Bay and then San José. But what about the average duration based on the age?
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "age_bins", y = "duration_sec", data = df_age.groupby(["age_bins", "month_year"], as_index = False).mean(), palette = "viridis")
plt.ylim(250,1500)
The longest duration can be found in the 10 to 20 years bins. After this bin the average duration drops, then it is slightly increasing until the 60, 70 bin. Now we should explore if the duration is also different based on the starting hour.
df_age.to_csv("df_age.csv")
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "start_hr", y = "duration_sec", data = df.groupby(["start_hr", "month_year"], as_index = False).mean(), palette = "viridis")
Starting from the hours 0, 1, 2, 3 we maybe have to deal with outliers. On the other hand we saw earlier, that there are not much trips starting at that time, so longer trips have a stronger impact then at 8 o' cloc kfor example. From 5 - 9 trips are relatively short with ~ 600 seconds, then the average rises to ~950 from 10 - 15. From 16 - 20 it drops again to ~700 seconds to finally increase slightly around 22 and 23.
fig, ax = plt.subplots(figsize = (20,10), dpi = 80)
sns.boxplot(x = "start_hr", y = "duration_sec", data = df.groupby(["start_hr", "month_year", "label_name"], as_index = False).mean(), palette = "viridis", hue = "label_name")
Here the hours 0 - 4 are getting more unclear. This graph confirms again, that San Francisco has, on average, the longest duration of trips, while users of East Bay and San José tend to have shorter trips.
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "dayofweek", y = "start_hr", data = df.groupby(["dayofweek", "month_year"], as_index = False).mean(), palette = "viridis")
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# Put a legend to the right of the current axis
ax.legend(legend_obj, days, loc='center left', bbox_to_anchor=(1, 0.5)) #misleading
This graph shows, that people start their trips, on average, later on the weekend than during the week.
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "dayofweek", y = "start_hr", data = df.groupby(["dayofweek", "month_year", "label_name"], as_index = False).mean(), palette = "viridis", hue = "label_name")
Looking at each area is interesting, because Users from East Bay and San José are not only have shorter trip durations on average, but also they start their trips later than San Francisco on average.
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "month_year", y = "start_hr", data = df.groupby(["month_year","dayofweek"], as_index = False).mean(), palette = "viridis")
This shows, that on average the users start later in the middle of the year than at the beginning or the end of the year.
fig, ax = plt.subplots(figsize = (15,10), dpi = 80)
sns.boxplot(x = "month_year", y = "start_hr", data = df.groupby(["month_year","dayofweek", "label_name"], as_index = False).mean(), palette = "viridis", hue = "label_name")
For the last explorative visualization we will take a look on a visualization on the trips with kepler.gl
df_station_names.head()
df_trips = df[["start_station_latitude", "start_station_longitude", "end_station_latitude", "end_station_longitude","start_station_id_new", "end_station_id_new"]].copy()
df_trips["cnt"] = 1
df_trips_grp = df_trips.groupby([x for x in df_trips.columns[:-1]], as_index = False).sum().sort_values("cnt", ascending = False)
df_trips_grp.head()
df_trips_grp.to_csv("grps.csv", index = False)
At first we will look at San Francisco. To get some insight, the visualization will only contain routes with more than 1000 trips:

We can see that most of the trips are close to the beach. Now for East Bay with routes with more than 500 trips:

Here the main routes are much more spread than in San Francisco. Also it looks like people use this service to quickly overcome smaller distances. For San José we will take a look on routes that have more than 200 trips.

For San José it looks spread over most of the stations.
From our exploration we can note that: